feat(linux): add AMD MI300X ROCm bootstrap#8824
Draft
wenhug wants to merge 1 commit into
Draft
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds initial end-to-end wiring for AMD MI300X ROCm enablement on AKS Linux nodes by introducing an AMD_GPU_NODE CSE signal, an AMD driver install/validation path in Ubuntu CSE, and an optional VHD prebake proof-of-concept behind an AMD_ROCM feature flag.
Changes:
- Plumbs
AMD_GPU_NODEfrom AgentBaker and aks-node-controller into the Linux CSE environment and routes AMD GPU nodes through a newensureAmdGpuDriverspath. - Implements Ubuntu 24.04-specific ROCm/AMDGPU installation + validation logic (DKMS/module/device nodes +
rocminfo/rocm-smichecks) and cleans up temporary apt repo configuration afterward. - Adds a VHD prebake path gated by
AMD_ROCMand extends the Linux VHD content tests to verify the prebaked marker/packages/module config and repo cleanup.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| vhdbuilder/scripts/linux/ubuntu/tool_installs_ubuntu.sh | Adds ROCm/AMDGPU VHD prebake functions (repo setup/cleanup, module autoload, validation, install). |
| vhdbuilder/packer/test/linux-vhd-content-test.sh | Adds a new testAmdRocmPrebake gated by AMD_ROCM to validate prebaked ROCm state and repo cleanup. |
| vhdbuilder/packer/install-dependencies.sh | Wires the AMD_ROCM feature flag to invoke installAmdRocmPrebake during VHD build and logs the marker. |
| pkg/agent/variables.go | Adds an amdGpuNode CSE variable derived from EnableAMDGPU. |
| pkg/agent/variables_test.go | Adds unit tests asserting amdGpuNode string output. |
| pkg/agent/baker_test.go | Adds a Linux CSE command test asserting AMD_GPU_NODE=true for an MI300X SKU config. |
| parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh | Adds ROCm/AMDGPU install + validation logic for Ubuntu 24.04 amd64, SKU gating, marker writing, and repo cleanup. |
| parts/linux/cloud-init/artifacts/cse_main.sh | Routes AMD_GPU_NODE=true through ensureAmdGpuDrivers before the NVIDIA driver path. |
| parts/linux/cloud-init/artifacts/cse_helpers.sh | Adds AMD ROCm-related error codes for the CSE path. |
| parts/linux/cloud-init/artifacts/cse_cmd.sh | Emits AMD_GPU_NODE into the CSE environment. |
| aks-node-controller/parser/parser.go | Adds AMD_GPU_NODE to the generated CSE environment map. |
| aks-node-controller/parser/parser_test.go | Extends parser tests to validate AMD_GPU_NODE presence/values in env. |
| aks-node-controller/parser/helper.go | Adds getEnableAmdGpu helper for config parsing. |
Comment on lines
+277
to
+280
| cat > /etc/apt/sources.list.d/rocm.list <<EOF | ||
| deb [arch=amd64 signed-by=${rocm_gpg_keyring_path}] https://repo.radeon.com/rocm/apt/${rocm_version} ${ubuntu_codename} main | ||
| deb [arch=amd64 signed-by=${rocm_gpg_keyring_path}] https://repo.radeon.com/graphics/${rocm_version}/ubuntu ${ubuntu_codename} main | ||
| EOF |
Comment on lines
+353
to
+357
| ensureAmdGpuDrivers() { | ||
| local rocm_version="${AMD_ROCM_VERSION:-7.2.4}" | ||
| local amdgpu_repo_version="${AMD_ROCM_AMDGPU_REPO_VERSION:-30.30.4}" | ||
| local amdgpu_dkms_version="${AMD_ROCM_AMDGPU_DKMS_VERSION:-1:6.16.13.30300400-2341068.24.04}" | ||
| local libdrm_amdgpu_dev_version="${AMD_ROCM_LIBDRM_AMDGPU_DEV_VERSION:-1:2.4.125.07020400-2341098.24.04}" |
Comment on lines
+243
to
+246
| cat > /etc/apt/sources.list.d/rocm.list <<EOF | ||
| deb [arch=amd64 signed-by=${rocm_gpg_keyring_path}] https://repo.radeon.com/rocm/apt/${rocm_version} ${ubuntu_codename} main | ||
| deb [arch=amd64,i386 signed-by=${rocm_gpg_keyring_path}] https://repo.radeon.com/graphics/${rocm_version}/ubuntu ${ubuntu_codename} main | ||
| EOF |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
This is a draft PR for validating AMD MI300X ROCm bootstrap support in AgentBaker. It adds the first end-to-end wiring needed for an AKS node to identify itself as an AMD GPU node, install the ROCm host driver/runtime pieces during Linux CSE, and expose the node in a shape that the AMD device plugin can consume.
The CSE path added here is intentionally scoped to Ubuntu 24.04 amd64 and the MI300X SKUs we validated first:
Standard_ND96isr_MI300X_v5Standard_ND96is_MI300X_v5Main changes:
AMD_GPU_NODECSE environment plumbing from AgentBaker and aks-node-controller.ensureAmdGpuDriverspath before the existing NVIDIA driver path.amdgpu-dkmslibdrm-amdgpu-devrocm-corerocminforocm-smi-libamdgpukernel module to load on boot and removes staleamdgpublacklist/install-false entries from/etc/modprobe.d.modprobe amdgpu,/dev/kfd,/dev/dri/renderD*,rocminfooutput forgfx942, androcm-smi --showproductnameoutput forAMD Instinct MI300X VF.repo.radeon.com.AMD_ROCMfeature flag and extends Linux VHD content tests to verify the prebaked ROCm marker, packages, module config, binaries, and repo cleanup.Validation performed:
make generategit diff --checkgo test ./pkg/agentgo test ./parserfromaks-node-controllerStandard_ND96isr_MI300X_v5Ubuntu 24.04 VM infrancecentraljoined to an AKS cluster through AKSFlexNode.Readywithamd.com/gpu=8capacity/allocatable after the AMD device plugin was installed.amdgpublacklist issue:/dev/kfdand/dev/dri/renderD*returned automatically, the flex node agent/nspawn services recovered, and the node became schedulable again.gfx942viarocminfoplusAMD Instinct MI300X VFviarocm-smi.Draft notes / open follow-ups:
repo.radeon.com; before production this likely needs an AKS-approved package source or mirror/cache decision.Which issue(s) this PR fixes:
N/A